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Abstract 

Applications of non-linear kernel Support Vector Machines (SVMs) to large datasets is 
seriously hampered by its excessive training time. We propose a modification, called the 
approximate extreme points support vector machine ( AESVM) , that is aimed at overcoming 
this burden. Our approach relies on conducting the SVM optimization over a carefully 
selected subset, called the representative set, of the training dataset. We present analytical 
results that indicate the similarity of AESVM and SVM solutions. A linear time algorithm 
based on convex hulls and extreme points is used to compute the representative set in 
kernel space. Extensive computational experiments on nine datasets compared AESVM 



to LI BSVM (IChang and LJn 
20071), LASVM (IBordes et al. 



2001b), CVM (Tsang et al. 2005) , BVM (Tsang et al. 



2005), SVMP*'"'*' (Joachims and Yu 2009), and the random 



features method (Rahimi and Recht 2007). Our AESVM implementation was found to 



train much faster than the other methods, while its classification accuracy was similar to 
that of LIBSVM in all cases. In particular, for a seizure detection dataset, AESVM training 
was almost 10'^ times faster than LIBSVM and LASVM and more than forty times faster 
than CVM and BVM. Additionally, AESVM also gave competitively fast classification 
times. 

Keywords: support vector machines, convex hulls, large scale classification, non-linear 
kernels, extreme points 



1. Introduction 

Several real world applications require solutions of classification problems on large datasets. 
Even though SVMs are known to give excellent classification results, their application to 
problems with large datasets is impeded by the burdensome training time requirements. 
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Recently, much progress has been made in the design of fast training algorithms ( [Fan et aL 
2008, Shalev-Shwartz et al. , 2011 ) for SVMs with the linear kernel (linear SVMs). However, 
many applications require SVMs with non-linear kernels for accurate classification. Training 
time complexity for SVMs with non-linear kernels is typically quadratic in the size of the 
training dataset (Shalev-Shwartz and Srebro, 2008). The difficulty of the long training 



time is exacerbated when grid search with cross-validation is used to derive the optimal 
hyper-parameters, since this requires multiple SVM training runs. Another problem that 
sometimes restricts the applicability of SVMs is the long classification time. The time 
complexity of SVM classification is linear in the number of support vectors and in some 



applications the number of support vectors is found to be very large ( Guo et al. 2005 ) 



In this paper, we propose a new approach for fast SVM training. Consider a two class 
dataset of N data vectors, X = {xj : Xj G R-^, i = 1,2, ...,N}, and the corresponding target 
labels Y = {yi : yi G [—1, l],i = 1,2, ...,N}. The SVM primal problem can be represented 



as the following unconstrained optimization problem (Teo et al. , 2010 Shalev-Shwartz et al 



2011): 



min -Fi(w, 6) = - 
w,6 2 



C ^ 

|w||2 + — ^/(w,6,0(Xi)) 



N- 



(1) 



j=i 



where /(w, 6, (;/)(xj)) = max{Q, 1 — yi{y^ 4>{'^i) + b)},\/ii.i G X 



and 



oD 



I, 6 G R, and w G H, a Hilbert space 



Here /(w, 6, </>(xj)) is the hinge loss of Xj. Note that SVM formulations where the penalty 



parameter C is divided by N have been used extensively (Scholkopf et al. , 2000 Pranc and 



Sonnenburg, 2008: Joachims and Yu 2009). These formulations enable better analysis of 



the scaling of C with A^ ( Joachims 2006 ) . The problem in M requires optimization over 
N variables. In general, for SVM training algorithms the training time will reduce if the 
size of the training dataset is reduced. 

In this paper, we present an alternative to (wl), called approximate extreme points support 
vector machines (AESVM), that requires optimization over only a subset of the training 
dataset. The AESVM formulation is: 



1 C 

minF2(w,&) = -||wf + — V^i/(w, 6,(/)(x4)) 
w,o z iV ■^ — ' 

t=l 

where x^ G X*, vi^ G H, and 6 G R 



(2) 



Here M is the number of vectors in the selected subset of X, called the representative set 



X*. The constants fit are defined in (10). We will prove in Section 3.2 that: 



Fi(w*,&*) — F2(w2,62) ^ CyCe, where {\^\,b\) and (w2,62) ai's the solutions of (|1 
and ([2]) respectively 

Under the assumptions given in corollary 4, Fi{w^2: ^2) ~ -^1(^1, fc^) < 2CvX7e 

The AESVM problem minimizes an upper bound of a low rank Gram matrix approx- 
imation of the SVM objective function 
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Based on these results we claim that solving the problem in Q yields a solution close 
to that of (II]). As a by-product of the reduction in size of the training set, AESVM is also 
observed to result in fast classification. Considering that the representative set will have 
to be computed several times if grid search is used to find the optimum hyper-parameter 
combination, we also propose fast algorithms to compute Z*. In particular, we present 
an algorithm of time complexity 0{N) and an alternative algorithm of time complexity 
0{N log2 p) to compute Z*, where P is a predefined large integer. 

The main contributions of this work can be summarized as follows: 



• Theoretical: Theorems 1 and 2, and Corollaries 3 to 5 give rationale for the use of 
AESVM as a computationally less demanding alternative to the SVM formulation. 

• Algorithmic: The algorithm DeriveRS, described in Section |4j computes the represen- 
tative set in linear time. 

• Experimental: Our extensive experiments on nine datasets of varying characteristics, 
illustrate the suitability of applying AESVM to classification on large datasets. 

This paper is organized as follows: in Section 2, we briefly discuss recent research on 
fast SVM training that is closely related to this work. Next, we provide the definition of 
the representative set and discuss properties of AESVM. In section 4, we present efficient 
algorithms to compute the representative set and analyze its computational complexity. 
Section 5 describes the results of our computational experiments. We compared AESVM 
to the widely used LIBSVM library, core vector machines (C VM), ball vector machine s 



Rahimi and Recht 



(BVM), LASVM, SVMP'^''^ and the random features method by 

Our experiments used eight publicly available datasets and a data set on EEC from an 



(2007). 



animal model of epilepsy (Talathi et al. , 2008; Nandan et al. , 2010). We conclude with a 



discussion of the results of this paper in Section 6. 



2. Related Work 



Several methods have been proposed to efficiently solve the SVM optimization problem. 
SVMs require special algorithms, as standard optimization algorithms such as interior point 



methods (Boyd and Vandenberghe , 2004: Shalev-Shwartz et al. , 2011) have large memory 



and training time requirements that make it infeasible for large datasets. In the following 
sections we discuss the most widely used strategies to solve the SVM optimization problem. 
We present a comparison of some of these methods to AESVM in Section |6| SVM solvers 
can be broadly divided into two categories as described below. 



2.1 Dual optimization 



The SVM primal problem is a convex optimization problem with strong duality ( Boyd and 



Vandenberghe 2004). Hence its solution can be arrived at by solving its dual formulation 



given below: 
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N N N 

max Li{a) = ^Oj - -'Y^'^aiajyiyjK{xi,Xj 



(3) 



4 = 1 



c 



N 



subject to < aj < — and y^aii/i = 



j=i 



Xj) (/)(xj), is the kernel product (Scholkopf and Smola, 2001) of the 



Here K{xi,Xj) = 

data vectors Xj and Xj, and a is a vector of all variables at. Solving the dual problem is 
computationally simpler, especially for non-linear kernels and a majority of the SVM solvers 
use dual optimization. Some of the major dual optimization algorithms are discussed below. 
Decomposition methods (Osuna et al. , 1997) have been widely used to solve ([3|. These 
methods optimize over a sub set of the trainin g dataset, called the 'w orking set', at each al- 
gorithm iteration. SVM'*^''* (jjoachimsj |l999|) and SMO (IPlattl |l999|) are popular examples 



of decomposition methods. Both these methods have a quadratic time complexity for linear 



and non- linear SVM kernels ( Shalev-Shwartz and Srebro, 2008). Heuristics such as shrink- 



ing and caching (Joachims 1999) enable fast convergence of decomposition methods and 



reduce their memory requirements. LIBSVM ( Chang and Lin| 2001b) is a very popular im- 
plementation of SMO. A dual coordinate descent (Hsieh et al. [2008 ) SVM solver computes 
the optimal a value by modifying one variable ai per algorithm iteration. Dual coordinate 



descent SVM solvers, such as LIBLINEAR (Fan et al. 2008), have been proposed primarily 
for the linear kernel. 



Approximations of the Gram matrix ( Fine and Scheinberg , 2002 Drineas and Mahoney 



2005), have been proposed to increase training speed and reduce memory requirements of 



SVM solvers. The Gram matrix is the NxN square matrix composed of the kernel products 
K{xi,Xj), Vxj, Xj G X. Training set selection methods attempt to reduce the SVM training 
time by optimizing over a selected subset of the training set. Several distinct approaches 
have been used to select the subset. Some methods use clustering based approaches (Pavlov 



et al. , 2000) to select the subsets. In Yu et al. (2003), hierarchical clustering is performed 



to derive a dataset that has more data vectors near the classification boundary than away 



from it. Minimum enclosing ball clustering is used in Cervantes et al. (2008) to remove data 



vectors that are unlikely to contribute to the SVM training. 

Random sampling of training data is another approach followed by approximate SVM 



solvers. Lee and Mangasarian (2001) proposed reduced support vector machines (RSVM), 
in which only a random subset of the training dataset is used. They solve a modified 



formulation of the L2-SVM that minimizes the Z^-norm of ^ instead of its /^-norm. Hordes 



et al. (2005) proposed the LASVM algorithm that uses active selection techniques to train 



SVMs on a subset of the training dataset. 

A core set (Clarkson, 2010) can be loosely defined as the subset of X for which the 



solution of an optimization problem such as ([3]) has a solution similar to that for the entire 
dataset X. Tsang et al. (2005) proved that the L2-SVM is a reformulation of the minimum 



enclosing ball problem for some kernels. They proposed core vector machine (CVM) that 
approximately solves the L2-SVM formulation using core sets. A simplified version of CVM 



called ball vector machine (BVM) was proposed in Tsang et al. (2007), where only an 
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enclosing ball is computed. Gartner and Jaggi (2009) proposed an algorithm to solve the 



Ll-SVM problem, by computing the shortest distance between two polytopes (Bennett and 
Bredensteiner , 2000) using core sets. However, there are no published results on solving 



Ll-SVM with non-linear kernels using their algorithm. 

Another method used to approximately solve the SVM problem is to map the data 
vectors into a randomized feature space that is relatively low dimensional compared to the 
kernel space EI fRahimi and Recht 2007). Inner products of the projections of the data 



vectors are approximations of their kernel product. This effectively reduces the non-linear 
SVM problem into the simpler linear SVM problem, enabling the use of fast linear SVM 
solvers. This method is referred as RfeatSVM in the following sections of this document. 

2.2 Primal optimization 

In recent years, linear SVMs are finding increased use in applications with high-dimensional 
datasets. This has led to a surge in publications on efficient primal SVM solvers, which are 
mostly used for linear SVMs. To overcome the difficulties caused by the non-differentiability 
of the primal problem, the following methods are used. 



Stochastic sub-gradient descent (Zhang, 2004) uses the sub-gradient computed at some 



data vector Xj to iteratively update w. Shalev-Shwartz et al. (2011) proposed a stochastic 



sub-gradient descent SVM solver, Pegasos, that is reported to be among the fastest linear 
SVM solvers. Cutting plane algorithms (Kelley, 1960) solve the primal problem by succes- 



sively tightening a piecewise linear approximation. It was employed by Joachims (2006) 



to solve linear SVMs with their implementation SVM^*^"^ . This work was generalized in 



Joachims and Yu (2009) to include non-linear SVMs by approximately estimating w with 



arbitrary basis vectors using the fix-point iteration method (Scholkopf and Smola 2001). 



Teo et al. (2010) proposed a related method for linear SVMs, that corrected some stability 



issues in the cutting plane methods. 



3. Analysis of AESVM 

As mentioned in the introduction, AESVM is an optimization problem on a subset of the 
training dataset called the representative set. In this section we first define the representa- 
tive set. Then we present some properties of AESVM. These results are intended to provide 
theoretical justifications for the use of AESVM as an approximation to the SVM problem 
([l]). We denote the cardinality of a set 5 by ISj. 

3.1 Definition of the representative set 



The convex hull of a set X is the smallest convex set containing X ( Rockafellar , 1996) and 



can be obtained by taking all possible convex combinations of elements of X. Assuming X 
is finite, the convex hull is a polygon. The extreme points of X, EP(X.), are defined to be 
the vertices of the convex polygon formed by the convex hull of X. Any vector Xj in X can 
be represented as a convex combination of vectors in EP(X.): 



y 7i"tXi, where < ttJ 
xte£;P(x) 



< 1 



and y 

xtG£;p(x) 



vr^ 



1 
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We can see that functions of any data vector in X can be computed using only £'P(X) 
and the convex combination weights {7r|}. The design of AESVM is motivated by the 
intuition that the use of extreme points may provide computational efficiency. However, 
extreme points are not useful in all cases, as for some kernels all data vectors are extreme 
points in kernel space. For example, for the Gaussian kernel, i^(xj,Xj) = (/)(xj) i?i>(xj) = 1. 
This implies that all the data vectors lie on the surface of the unit ball in the Gaussian kernel 
space and therefore are extreme points. Hence, we introduce the concept of approximate 
extreme points. 

Consider the set of transformed data vectors: 

Z = {z, :Zi = (/.(xi),VxiGX} (4) 

Here, the explicit representation of vectors in kernel space is only for the ease of under- 
standing and all the computations are performed using kernel products. Let F be a positive 
integer that is much smaller than N and e be a small positive real number. For notational 
simplicity, we assume N is divisible by V. Let Z/ be subsets of Z for / = 1, 2, ..., (y ), such 
that Z = UZi and Z/ n Zm = for / 7^ ?7i, where m = 1,2, ..., (y). We require that the 

subsets Zii satisfy | Z; | =V,\/l and 

Vzj, Zj eZi, we have yi = yj (5) 

Let Z' denote an arbitrary subset of Z^. Next, for any Zj G Z; we define: 

/(zi,Zf) = min||zj - ^ /xjzilp (6) 

s.t. < /x^ < 1, and ^ ^j = 1 



Consider the collection of subsets 

Z,:={Zf:max/(z„Zf)<e} 

A set of approximate extreme points of Z/ is denoted by Z^ , and is defined as follows r] 

Z; G argmin |Zf| (7) 

It can be seen that fi} for zt G Z^ are analogous to the convex combination weights nl for 
Xi G EP(X.). The representative set Z* of Z is the union of the sets of approximate extreme 
points of its subsets Z/. 

JV 

z* = u z; 
1=1 ' 



1. The properties derived for AESVM in Section 3.2 are valid for any Z'. The requirement for the smallest 

5? 



Z^ is made only for the sake of a computationally simpler AESVM problem 
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The representative set has properties that are similar to EP(X.). Given any Zj G Z, we 
can find Z; such that Zj G Z/. Let 7^ = {/ij for zj G Z^ and Zj G Z^, and otherwise}. Now 
using Q, we can write: 

ztez* 

Here tj is a vector that accounts for the approximation error /(zj, Z^) in (pi). From ([6|)-(|8|) 
we can conclude that: 

ll-TilP < eVzi G Z (9) 

Since e will be set to a very small positive constant, we can infer that tj is a very small 
vector. The weights jI are used to define /3( in Q as: 

N 



A = E^* (10) 



4 = 1 

For ease of notation, we refer to the set X* := {x^ : zj G Z*} as the representative 
set of X in the remainder of this paper. For the sake of simplicity, we assume that all 
7j, /3(, X, and X* are arranged so that X* is positioned as the first M vectors of X, where 

M = \Z*\. 

3.2 Properties of AESVM 

Consider the following optimization problem. 

N 



minF3(w,6) = -||wf +-V/(w,6,Ui) (11) 

w,o Z Jv ^ — ' 

i=l 

M 

where Uj = ^Tt^;*, zt G Z*, w^ G H, and 6 G M 

i=l 

We use the problem in (11) as an intermediary between (II]) and (pi). The intermediate 
problem ( 11 ) has a direct relation to the AESVM problem, as given in the following theorem. 



The properties of the max function given below are relevant to the following discussion: 

max{0, A + B) < max{0, A) + max{0, B) (12) 

max(0. A- B)> max{0, A) - max{0, B) (13) 

N N 

y^max(0, cM) = maa;(0, -4)^^* (14) 

i=l 1=1 

for A, S, c* G M and d > 0. 

Theorem 1 Let F3{\i^,b) andF2{^^,b) be as defined in (11) and M) respectively. Then, 



F3{w, b) < F2(w, 6) , Vw G M and 6 G 
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Af TV N 

Proof Let C2{w,b,X.*) = ^EK^,b,zt)Zli and £3(w,6,X*) = ^E^(w,^Ui), where 



t=i 



i=l 



M 



Uj = X^Ti^t- From the properties of 7^ in (6), and from (5) we get: 



t=i 



i=l 



£3(w,6,X*) = ^^max 



N 



N 



iV 



AT 

E 



M 



max 



A/ 



t=i 



Using properties (12) and (14) we get: 

„ N M 



£3(w,6,X*) < -T,^^max [0,f, {l - yt{^^zt + b)}] 



=1 t=i 



iV 



A/ 



N 



;^max [0, 1 - ytiw^zt + b)] ^-fj 
t=i i=i 

/:2(w,6,x*) 



111 1 1 9 

Adding 5 11^11 to both sides of the inequality above we get 

F3{w,b) <F2(w,6) 



(15) 



The following theorem gives a relationship between the SVM problem and the intermediate 

problem. 

Theorem 2 Let i<'i(w,6) and i<3(w,6) be as defined in pp and (11) respectively. Then, 



C ^ C ^ 

- — y^max {0, yi'w'^Ti] < Fi(w, b) - F-s^w, b) < jr'^rnax {O, -yiw'^Ti] 

i=l i=l 

\/-w G EI and 6 € M, where Tj G EI is the vector defined in (pi). 



N 



Proof Let £i(w,6, X) = ^E/(w,6, Zj), denote the average hinge loss that is minimized 



j=i 



in ([I]) and £3(w,6, X*) be as defined in Theorem 1. Using ([s]) and ([I]) we get: 



C 



N 



£i(w,6,X) = —^max {0,1 - yi(w^Zi + 6)} 



N 

C_ 

N 



i=l 



Af r A/ ^ 

"^max < 0, 1 - yi(w^(^74*zt + n) + 6) I 
1=1 I t=i J 
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From the properties of 7^ in Am, and from (l5j) we get: 



£i(w,6,X) = j^'^max I 0,^7^(1 -yj(w^Zi + 6)) -yiw'^n \ 

i=l { i=l J 



(16) 



Using (12) on (16), we get: 

C ^ 
£i(w,&, X) < — > Tnax 

1=1 



M 



0,^7Hl-yi(w'^zt + 6)} 



t=i 



C 



N 



+ T^X]"^'^^ {'^' ~2/«'*^^'^«} 



Af 



j=i 



C 



AT 



A(w, 6, X*) + -rr'^rnax {O, -y^w'^Ti} 



j=i 



Using (13) on (16), we get: 

N 



C^ 



£i(w,6,X)>-j; 



max 



i=l 



M 



o,^7i{i-yt(w'^^t + ^)} 



t=i 






N 



y^max [Q^yj-w"^ Tj] 



4 = 1 



C 



Af 



£3(w, 6, X*) - — y^max {O, y^w'^ri} 



i=l 



From the two inequaUties above we get, 

C ^ C ^ 

C-ijw, b, X*)-— y^max {O, yjw'^Ti] < £i(w,6, X) < £3(w, 6,X*)+— ^max {O, -y^w'^Tj} 



j=i 



i=l 



111 1 1 9 

Adding 9||w|| to the inequality above we get 



^ N ^ N 

F3(w,6) - —'^max{0,yiw'^Ti} < Fi(w,6) < F3(vi^,6) + — y^max {O, -yiw'^Tj} 



N- 



j=i 



N- 



i=l 



Using the above theorems we derive the following corollaries. These results provide the 
theoretical justification for AESVM. 
Corollary 3 Let (w^,6^) be the solution of pp and (w2,62) ^e the solution of ^. Then, 

Fi{wl,bl)-F2{^v*^,bl)<CVC~e 



Proof It is known that ||w^|| < yC' (refer Theorem 1 in Shalev-Shwartz et al. (2011)). It 
is straight forward to see that the same result also applies to AESVM, ||w2|| < VC . Based 
on ([9| we know that ||tj|| < ^/e. From Theorem 2 we get: 

^ N ^ N 

-^l(w2,&2) -^3(W2,62) < l^^lTT-aX {0, -yiW'fTi} < ^^||w2 1| ||ri|| 



A^ 

c 



j=l 

N 



1=1 



i=l 
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Since (w^,6^) is the solution of ([I]), Fi(wj|',6^) < Fi(w2,62). Using this property and 
Theorem 1 in the inequahty above, we get: 

Fi{wl,bl) - F2{^l,bl) < Fi{wl,bl) - Fs{^*2,b*2) 

<Fiiwlbl)-Fsiwlbl)<CVC~e (17) 



Now we demonstrate some properties of AESVM using the dual problem formulations 
of AESVM and the intermediate problem. The dual form of AESVM is given by: 

M , M M 



max L2{a) = y^at - -V'V'ata^yjyszf z^ (18) 

i=l t=\s=\ 

„ N M 

subject to < Oi < "77 /^7i and \^atyt = 

The dual form of the intermediate problem is given by: 

N N N 

max L^{a) = V'ai - -V^aiajyiyj-ajMj (19) 

a. ^ — ' A — — 

c ^ 

subject to < Qj < — and y^&iyi = 

Consider the mapping function h : M — )• M , defined as 

N 
h{a) = {at : at = J^^toJ (20) 

1=1 

It can be seen that the objective functions L2{h{a)) and ^3(0) are identical. 

M , M M 

]atasytysZt z^ 



M M M 

L2{h{a)) = ^Oi - -Y^^dtasytyszfi 



2- 

t=l t=ls=l 

N , N N 

aiajyiyju- Uj 



E°.-5EE-' '•■ 



2 



Lsia) 



It is also straight forward to see that, for any feasible a of (19), h{a) is a feasible point of 



(18) as it satisfies the constraints in (18). However, the converse is not always true. With 

that clarification, we present the following corollary. 

Corollary 4 Let ('w^, b*) be the solution of (Til) and ('Wg, 63) be the solution of [^. Let 0:2 be 



the dual variable corresponding to (w2,62)- Let h{a2) be as defined in (20). If there exists 



an 02 such that h{a.2) = 0.2 and 0.2 is a feasible point of (19), then, 



Fi(w^,6^)-Fi(wt,6^)<2CVCi 
10 
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Proof Let (wg,?)!) be the solution of ( |Tl| ) and 03 the solution of (19). We know that 
-^3(0:2) = -^2(02) = F2(w2,62) and Lsi&s) = F3{w^,b^). Since Lsia^) > L3{a2), we get 

F3(w*,6*)>F2(w*,6*) 
But, from Theorem 1 we know F3(w3,63) < F3(w2,62) < -^2(w2,62)- Hence 

F3(w^,6^) = F3(w^,6*2) 
From the above result we get 

F3(w^,6^)-F3(w^6^)<0 
From Theorem 2 we have the following inequalities: 



'n 



N 



^max{0,yiwfTi} < Fi{wl,bl) - Fs{y^l,bl) 



i=l 



c 



N 



Fi(w^, 6*2) - i^3(w^, ^2) < T^5^"ia^ {0, -y^wf rj 



(21) 

(22) 
(23) 



i=l 



Adding (22) and (23) we get: 



C 



N 



Fi(w^, 6*2) - Fli^^,l,bl) <R+ -^Yl ["^"^ {0' -y^^f^i} + "^«^ {0' y^^fri}] (24) 



i=l 



where R = F3(w2,&2) "-^3(^1,6^). Using (21) and the properties HwgH < vC and \\wl\\ < 
VC in (El: 



C 



N 



Fi(w2,62) - Fi{wl,bl) < ^^ [max{0,-yiW2^Ti} + max {0,yiwfTi}] 



< 



N 

C_ 

N 

C 



■1=1 

N 



X^ll^slllkill + l|wi||||ri 



AT 



< —y"2VC~e = 2CVC~e 



i=l 



Now we prove a relationship between AESVM and the Gram matrix approximation 
methods mentioned in Section [2.11 

Corollary 5 Let Li{a), L3(ci), and F2{w,b) be the objective functions of the SVM dual 
(^, intermediate dual p^) and AESVM M) respectively. Let Zj, tj, and Uj be as defined 
in (M), 1^, and (11) respectively. Let G and G be the NxN matrices with Gij = yiyjzfzj 
and Gij = yiyjufuj respectively. Then for any feasible a, a, w, and b: 
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1. Rank of G = M,Li{a) = X^Oj — ^aGa^ , L3{a) = Yl^i ~ ^&Ga^ , and 

i=l i=l 

M N 



Trace(G - G) < iVe + 2Y^zfJ2"fh 



t=l i=l 

2. F2(w,6) >L3(a) 
Proof Using G, the SVM dual objective function Li{a) can be represented as: 

N 



Li(q) = y^Q-i — -aGa 



i=l 



Similarly, L^{a) can be represented using G as: 

TV 



-^3(0) = X]«i - -aGoF 



i=l 



M 



Applying Uj = X]7t^t> ^^t ^ ^* ^o the definition of G, we get: 

G = TAr"^ 

Here A is the MxM matrix comprised of A^^ = ytysZ^z^, \/zt,Zs € Z* and T is the NxM 
matrix with the elements Tu = jl- Hence the rank of G = M and intermediate dual 
problem (19) is a low rank approximation of the SVM dual problem ([3]). 

The Gram matrix approximation error can be quantified using (Is]) and ([9]) as: 



N 



Trace(G - G) = ^ 

4 = 1 

N 

= 1: 



M 



M 



^f^i - (Z]^t^*)^E^^^^ 



i=l 



t=l 

M 



s=l 



Tin + ^Yjit'z^tri 



t=\ 



M N 



<Ne + 2Y,^JY.^ln 



t=l i=l 



By the principle of duality, we know that -F3(w, b) > L^la), Vw G EI and 6 € M, where 
a is any feasible point of (IT9|). Using Theorem 1 on the inequality above, we get 



F2(w, b) >L^{a), Vw G M, 6 G M and feasible a 

Thus the AESVM problem minimizes an upper bound (i<2(w, b)) of a rank M Gram matrix 
approximation of Li (a). ■ 

Based on the theoretical results in this section, it is reasonable to suggest that for small 
values of e, the solution of AESVM is close to the solution of SVM. 
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4. Computation of the representative set 

In this section, we present algorithms to compute the representative set. The AESVM 
formulation can be solved with any standard SVM solver such as SMO and hence we do 



not discuss methods to solve it. As described in Section 3.1, we require an algorithm to 



compute approximate extreme points in kernel space. Osuna and Castro (2002) proposed 
an algorithm to derive extreme points of the convex hull of a dataset in kernel space. 
Their algorithm is computationally intensive, with a time complexity of 0{N S{N)), and 
is unsuitable for large datasets as S{N) typically has a super-linear dependence on N. The 
function S{N) denotes the time complexity of a SVM solver (required by their algorithm), 
to train on a dataset of size N. We next propose two algorithms leveraging the work by 



Osuna and Castro (2002) to compute the representative set in kernel space Z* with much 
smaller time complexities. 

We followed the divide and conquer approach to develop our algorithms. The dataset is 
first divided into subsets Xg, q = 1,2, .., Q, where |Xq| < P, Q > j; andX = {Xi,X2, .., Xq}. 
The parameter P is a predefined large integer. It is desired that each subset X^ contains 
data vectors that are more similar to each other than data vectors in other subsets. Our 
notion of similarity of data vectors in a subset, is that the distances between data vectors 
within a subset is less than the distances between data vectors in distinct subsets. This 
first level of segregation is followed by another level of segregation. We can regard the first 
level of segregation as coarse segregation and the second as fine segregation. Finally, the 
approximate extreme points of the subsets obtained after segregation, are computed. The 
two different algorithms to compute the representative set differ only in the first level of 
segregation as described in the following sections. 

4.1 First level of segregation 

We propose the methods, FLSl and FLS2 given below to perform a first level of segregation. 
In the following description we use arrays A' and Ag of N elements. Each element of A' 
(Ay, 6i {5f) , contains the index in X of the last data vector of the subset to which Xj 
belongs. It is straight forward to replace this A^ element array with a smaller array of size 
equal to the number of subsets. We use a N element array for ease of description. 

1. FLS1(X',P) 

For some applications, such as anomaly detection on sequential data, data vectors are 
found to be homogeneous within intervals. For example, the atmospheric conditions typ- 
ically do not change within a few minutes and hence weather data is homogeneous for a 
short span. For such datasets it is enough to segregate the data vectors based on its position 
in the training dataset. The same method can also be used on very large datasets without 
any homogeneity, in order to reduce computation time. The complexity of this method is 
0{N'), where A^' = |X'| . 

2. FLS2(X',P) 

When the dataset is not homogeneous within intervals or it is not excessively large we 
use the more sophisticated algorithm, FLS2, of time complexity 0{N' log2-p-) given below. 
In step 1 of FLS2, the distance di in kernel space of all Xj € X' from Xj is computed as 
di = ||0(xi) - 0(xj)||2 = A;(xi,Xi) + k{xj,Xj) - 2/c(xi,Xj). The algorithm FLS2(X',P), in 
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[X',A'] = FLS1(X',P) 

l-Y"/! 

1. For outerlndex = 1 io ceiling(^-p-^) 

2. For innerlndex = (outerlndex - 1)P to min((outerIndex)P,|X'| 

3. Set 5innerindex = min{{outerIndex)P, |X'|) 



effect builds a binary search tree, with each node containing the data vector x^ selected in 
step 2 that partitions a subset of the dataset into two. The size of the subsets successively 
halve, on downward traversal from the root of the tree to the other nodes. When the size of 
all the subsets at a level become < P the algorithm halts. The complexity of FLS2 can be 
derived easily when the algorithm is considered as an incomplete binary search tree building 
method. The last level of such a tree will have 0{^) nodes and consequently the height 



p 

N' 



of the tree is 0(log2-p-). At each level of the tree the calls to the BFPRT algorithm (Blum 



et al. , 1973) and the rearrangement of the data vectors in steps 2 and 3 are of 0{N') time 
complexity. Hence the overall time complexity of FLS2(X', P) is 0{N' log2-p-). 



[X',A'] =FLS2(X',P) 

1. Compute distance di in kernel space of all Xj G X' from the first vector Xj in X' 

l-V"/! 

2. Select x/^ such that there exists ^-^ data vectors Xj G X' with di < dk, using the 
linear time BFPRT algorithm 

3. Using Xfc, rearrange X' as X' = {X^,X^}, where X^ = {xj : di < dk,^i € X'} and 
X2 = {xi : Xi G X' and x^ X^} 

4. If J^ < P 

For i where Xj G X^, set 6i = index of last data vector in X^. 
For i where Xj G X^ , set 6i = index of last data vector in X^ . 



Run FLS2(Xi,P) and FLS2(X2,P) 



5. If ^ > P 



4.2 Second level of segregation 

After the initial segregation, another method SLS(X', V, A') is used to further segregate each 
set Xq into smaller subsets Xg^ of maximum size V, Xg = {Xg^,Xq2, ....,Xq^}, where V is 
predefined {V < P) and R = ceiling{'-^). The algorithm SLS(X', T/, A') is given below. 
In step 2.b, x^ is the data vector in Xg that is farthest from the origin in the space of the 
data vectors. For some kernels, such as the Gaussian kernel, all data vectors are equidistant 
from the origin in kernel space. If the algorithm chooses a' in step 2.b based on distances in 
such kernel spaces, the choice would be arbitrary and such a situation is avoided here. Each 
iteration of the For loop in step 2 involves several runs of the BFPRT algorithm, with each 
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run followed by a rearrangement of X^. Specifically, the BFPRT algorithm is first run on P 
data vectors, then on P — V data vectors, then on P — 2V data vectors and so on. The time 
complexity of each iteration of the For loop including the BFPRT algorithm run and the 
rearrangement of data vectors is: 0{P+ {P — V) + {P — 2V) + .. + V) =^ 0{^). The overall 
complexity of SLS(X',y, A') considering the Q For loop iterations is 0(-p-^) =^ 0{^^), 
since Q = 0{^). 



[X',A'2] = SLS(X',y,A0 



1. Initialize I = 1 

2. For q = 1 to Q 

(a) Identify subset Xg of X' using A' 

(b) Set a = 0(xt), where xj € argmax ||xj|p,Xj G Xg 

i 

(c) Compute distance di in kernel space of all Xj G Xg from a' 

(d) Select x^ such that, there exists V data vectors Xj G Xg with di < d^, using the 
BFPRT algorithm 

(e) Using x^, rearrange Xg as Xg = {X^,X^}, where X^ = {xj : di < dk,:x-i G Xg} 
and X^ = {xj : Xj G Xg and x,j X-*^} 

(f) For i where Xj G X-*^, set 6f = index of last data vector in X^, where 6f is the i^^ 
element of Ag 

(g) Remove X^ from Xg 
(h) If |X2| > V 

Set: / = / + 1 and a = x^ 
Repeat steps 2.c to 2.h 
(i) If |X2| < V 

For i where Xj G X^, set 6f = index of last data vector in X^ 



4.3 Computation of the approximate extreme points 

After computing the subsets Xg^ , the algorithm DeriveAE is applied to each Xg^ to compute 
its approximate extreme points. The algorithm DeriveAE is described below. DeriveAE uses 
three routines. SphereSet(Xg^) returns all Xj G Xg^ that lie on the surface of the smallest 
hypersphere in kernel space that contains Xg^. It computes the hypersphere as a hard 



margin support vector data descriptor (SVDD) (Tax and Duin, 2004). SphereSort(Xg^) 
returns data vectors Xj G Xg^ sorted in descending order of distance in the kernel space 
from the center of the SVDD hypersphere. CheckPoint(xj, ^) returns TRUE if Xj is an 
approximate extreme point of the set ^ in kernel space. The operator A\B indicates a 
set operation that returns the set of the members of A excluding An B. The matrix X* 
contains the approximate extreme points of Xg^ and /3g^ is a |X* | sized vector. 
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[X*,/3,J=DeriveAE(X 



1. Initialize: X*^ = SphereSet(XgJ and * = 

2. Set C = SphereSort(XgAX;j 

3. For each Xj taken in order from (, call the routine CheckPoint(xj,X* U ^) 

If it returns FALSE, then set ^ = \& U Xj 

4. For each Xj G ^, execute CheckPoint(xj, X*^ U {^\xj}) 

If it returns FALSE, set X* = X* U Xj 

' y?' yr 

5. Initialize a matrix F of size iXo IxlX* I with all elements set to 

I yr I I yr I 

Set /i^ = 1 Vxfc G X* , where /i*- is the element in the i row and j column of F 

6. For each Xj G X^,, and Xj ^L) execute CheckPoint(xj,X* ) 

Set the i*^ row of F = ^% where ^* is the result of CheckPoint(xj, X* ) 

7. For j = 1 to IX* I 

Set 131 = E Mi 
fc=i 



Checkpoint (xj, ^) is computed by solving the following quadratic optimization problem: 
min p(xi,*) = ||</.(xi) - ^/ij(/)(xf)||2 

s.t. xt G ^, < //^ < 1 and ^/LiJ = 1 

t=i 

where ||(/)(x,) - E/^J'/'(xi)f = i^(xt,Xi) + ^ E/^J/^i^(xt,x,) - 2ZfiiK{^i,^t). If the 

t=l t=ls=l t=l 

optimized value of p(xj, 'I') < e, CheckPoint(xj, ^) returns TRUE and otherwise it returns 
FALSE. It can be seen that the formulation of p(xj, ^) is similar to (pi). The value of ^u* 
computed by CheckPoint(zj, ^o), is used in step 6 of DeriveAE. 

Now we compute the time complexity of DeriveAE. We use the fact that the optimization 
problem in CheckPoint(xj, ^) is essentially the same as the dual optimization problem of 
SVM given in ([3]). Since DeriveAE solves several SVM training problems in steps 1,3,4, 
and 6, it is necessary to know the training time complexity of a SVM. As any SVM solver 
method can be used, we denote the training time complexity of each step of DeriveAE that 
solves an SVM problem as 0{S{Aq^)) R Here Ag^ is the largest value of X*^. U ^ during the 
run of DeriveAE(Xg^). This enables us to derive a generic expression for the complexity of 
DeriveAE, independent of the SVM solver method used. Hence the time complexity of step 1 
is 0{S{Aq^)). The time complexity of steps 3, 4 and 6 are 0{V S{Ag^)), 0{Ag^ S{Ag^)), and 



2. For SMO based implementations, such as the implementation we used for Sectionpl S{A) = 0{A^ 
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0{Ag^ S{Ag^)) respectively. The time complexity of step 2 is 0{V |^i| + V log2F), where 
^1 = SphereSet(Xq,,). Hence the time complexity of DeriveAE is 0{V |$| + V log2l^ + 
V S{Aq^) + Aq^ S{Aq^)). Siiicc 1^1 1 is typically very small and Aq^. < V, we denote the 
time complexity of DeriveAE by 0{V log2V^ + V S{Aq^)). 

4.4 Combining all the methods to compute X* 

To derive X* , it is required to first rearrange X, so that data vectors from each class 
are grouped together as X = {X+,X~}. Here X+ = {xj : y^ = l,Xj G X} and X^ = 
{xj : yi = — 1 , Xj G X} . Then the selected segregation methods are run on X+ and 
X~ separately. The algorithm DeriveRS given below, combines all the algorithms defined 
earlier in this section with a few additional steps, to compute the representative set of 
X. The complexity of DeriveRS |j can easily be computed by summing the complexities 
of its steps. The complexity of steps 1 and 6 is 0(N). The complexity of step 2 is 0{N) 
if FLSl is run or 0{N log2p) if FLS2 is run. In step 3, the 0{^y~) method SLS is run. 
In steps 4 and 5, DeriveAE is run on all the subsets Xg^ giving a total complexity of 

Q R 
0{N log2F + VY^ J2'S{^qr))- Here we use the fact that the number of subsets Xg^ is 

q=lr=l 

Q R 
0(f ). Thus the complexity of DeriveRS is 0{N{^ + log2l^) + VJ2H S{Aq^)) when FLSl 

g=lr=l 
Q R 

is used and 0(iV(log2f + y + log2F) + ^E E S{Aq^)) when FLS2 is used. 

q=lr=l 

[X*, Y*,;^] = DeriveRS(X,Y,P,V) 

1. Set X+ = {xj : Xj e X, yj = 1} and X^ = {xj : Xj G X, yj = —1} 

2. Run [X+,A+] = FLS(X+,P) and [X-,A-] = FLS(X-,P), where FLS is FLSl or 
FLS2 

3. Run [X+,A+] = SLS(X+,V,A+) and [X-,A^] = SLS(X-,V,A-) 

4. Using A^, identify each subset X^^ of X+ and run [X* ,/3q^] = DeriveAE(Xg^) 

Set N"^* = sum of number of data vectors in all X* derived from X+ 

5. Using A2", identify each subset Xg^ of X^ and run [X* ,/3g^] = DeriveAE (Xg,,) 

Set N^* = sum of number of data vectors in all X* derived from X~ 

yr 

6. Combine in the same order, all X* to obtain X* and all Pq^ to obtain /3 

Set Y* = {yi : y, = 1 for i = 1,2,..,N+*; and y^ = -1 for i = 1 + A^+*,2 + 
N+*,..,N-* + N+*} 



3. We present DeriveRS as one algorithm in spite of its two variants that use FLSl or FLS2, for simplicity 
and to conserve space. 

17 



Nandan, Khargonekar, and Talathi 



5. Experiments 



We focused our experiments on an SMO (Fan et al. 2005 ) based implementation of AESVM 
and DeriveRS. We evaluated the classification performance of AESVM using the nine 
datasets, described below. Next, we present an evaluation of the algorithm DeriveRS, 
followed by an evaluation of AESVM. 



5.1 Datasets 

Nine datasets of varied size, dimensionality and density were used to evaluate DeriveRS and 
our AESVM implementation. For datasets D2, D3 and D4, we performed five fold cross 
validation. We did not perform five fold cross-validation on the other datasets, because 
they have been widely used in their native form with a separate training and testing set. 



Dl: 



D2: 



D3: 



D4: 



KDD'99 intrusion detection datase^ This dataset is available as a training set of 
4898431 data vectors and a testing set of 311027 data vectors, with forty one features 
(D = 41). As described in Tavallaee et al. (2009), a huge portion of this dataset is 



comprised of repeated data vectors. Experiments were conducted only on the distinct 
data vectors. The number of distinct training set vectors was N = 1074974 and the 
number of distinct testing set vectors was N = 77216. The training set density = 
33%. 

Localization data for person activitvQ - This dataset has been used in a study on 
agent-based care for independent living (Kaluza et al. , 2010). It has A^ = 164860 



data vectors of seven features. It is comprised of continuous recordings from sensors 
attached to five people and can be used to predict the activity that was performed by 
each person at the time of data collection. In our experiments we used this dataset 
to validate a binary problem of classifying the activities 'lying' and 'lying down' from 
the other activities. Features 3 and 4, that gives the time information, were not used 
in our experiments. Hence for this dataset D = 5. The dataset density = 96%. 

Seizure detection dataset- This dataset has A^ = 982863 data vectors, three features 
(D = 3) and density = 100%. It is comprised of continuous EEG recordings from rats 
induced with status epilepticus and is used to evaluate algorithms that classify seizure 
events from seizure-free EEG. An important characteristic of this dataset is that it 
is highly unbalanced, the total number of data vectors corresponding to seizures is 
minuscule compared to the remaining data. Details of the dataset can be found in 



Nandan et al. (2010), where it is used as dataset A. 



Forest cover type datase^ This dataset has N = 581012 data vectors and fifty four 
features {D = 54) and density = 22%. It is used to classify the forest cover of areas 



of 30mx30m size into one of seven types. We followed the method used in Collobert 



et al. (2002), where a classification of forest cover type 2 from the other cover types 



was performed. 



4. |http : //archive . ics .uci . edu/ml/datasets/KDD+Cup+1999+Data 



5. http: //archive . ics .uci .edu/ml/datasets/Localization+Data+f or+Person+Activity 

6. http: //archive . ics .uci .edu/ml/datasets/Covertype 
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D5 



D7 



IJCNNl datasety^ This dataset was used in IJCNN 2001 generalization ability chal- 
lenge (Chang and Lin, 2001a). The training set and testing set have 49990 {N = 
49990) and 91701 data vectors respectively. It has 22 features {D = 22) and training 
set density = 



D6 : Adult income dataset^ This dataset derived from the 1994 Census database, was used 
to classify incomes over $50000 from those below it. The training set has A^ = 32561 
with D = 123 and density = 11%, while the testing set has 16281 data vectors. The 



data is pre-processed as described in Piatt ( 1999 ) 



Epsilon dataset'Q This is a dataset that was used for 2008 Pascal large scale learning 
challenge and in Yuan et al. (2011). It is comprised of 400000 data vectors that are 



100% dense with D = 2000. Since this is too large for our experiments, we used 
the first 10% of the training set giving A^ = 40000. The testing set has 100000 data 
vectors. 



D8 : MNIST character recognition dataset 



10 



The widely used dataset (jLecun et al.| 1998) 
60000, D 



of hand written characters has a training set of iV = 60000, D = 780 and density = 
19%. We performed the binary classification task of classifying the character '0' from 
the others. The testing set has 10000 data vectors. 



D9 : w8a datasetVH This artificial dataset used in Piatt (1999) was randomly generated 
and has D = 300 features. The training set has A^ = 49749 with a density = 4% and 
the testing set has 14951 data vectors. 



5.2 Evaluation of DeriveRS 

We began our experiments with an evaluation of the algorithm DeriveRS, described in 
Section [4j The performances of the two methods FLSl and FLS2 were compared first. We 
ran DeriveRS on Dl, D2, D4 and D5 with the parameters P = 10^, V = 10'^, e = 10"'^, and 
g = [2-^, 2-3, 2-2, ..., 2^], first with FLSl and then FLS2. For D2, DeriveRS was run on the 
entire dataset for this particular experiment, instead of performing five fold cross-validation. 
This was done because, D2 is a small dataset and the difference between the two first level 
segregation methods can be better observed when the dataset is as large as possible. The 
relatively small value of P = 10^ was also chosen considering the small size of D2 and D5. 
To evaluate the effectiveness of FLSl and FLS2, we also ran DeriveRS with FLSl and FLS2 
after randomly reordering each dataset. The results are shown in Figure [T] 

For datasets Dl and D5, FLS2 gave smaller representative sets in a shorter duration 
than FLSl. As expected, for the relatively homogeneous dataset D2, FLSl and FLS2 gave 
similar results, with FLS2 giving slightly larger representative sets. Dataset D4 was seen to 
have much smaller representative sets with FLSl than with FLS2. The results of DeriveRS 
obtained after randomly rearranging the datasets, indicate the utility of FLS2. For all the 



7. http://www.csie.ntu.edu.tw/-c-jlin/libsviiitools/datasets/bineLry.html 



'http : //www . csie . ntu . edu . tw/~c j lin/libsvmtools/datasets/binary . html' 
9. http: //www. csie .ntu.edu. tw/~cjlin/libsvmtools/datasets/binary. html 

10. http: //www. csie .ntu. edu. tw/~cjlin/libsvmtools/datasets/multiclass .html 

11. http: //www. csie . ntu. edu. tw/~cjlin/libsvmtools/datasets/binary .html 
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Figure 1: Performance of variants of DeriveRS with g = [2 ^,2 ^,2 ^,...,2^], for datasets 
Dl, D2, D4, and D5. The results of DeriveRS with FLSl and FLS2, after ran- 
domly reordering the datasets are shown as Random+FLSl and Random+FLS2, 
respectively 



datasets, the results of FLS2 after random reordering was seen to be significantly better 
than the results of FLSl after random rearrangement. Hence we can infer that the good 
results obtained with FLS2 are not caused by any pre-existing order in the datasets. After 
D2 and D4 were randomly rearranged a sharp increase was observed in representative set 
sizes and computation times for DeriveRS with FLSl. This indicates the importance of 
dataset homogeneity to the performance of FLSl. The results indicated for randomized 
experiments on DeriveRS are the averages of five repetitions. 

Next we investigated the impact of changes in the values of the parameters P and 

V on the performance of DeriveRS. All combinations of P = {10^,5x10^, 10^,2x10^} and 

V = {10^, 5x10^, 10^, 2x10^, 3x10^} were used to compute the representative set of Dl. The 

computations were performed for e = 10~^ and g = 1- The method FLS2 was used for the 

first level segregation in DeriveRS. The results are shown in Table [T] As expected for an 

Q R 
algorithm of time complexity 0(A^(log2 p + y +log2F) + V^'^ S{Aq^)), the computation 

g=lr=l 

time was generally observed to increase for an increase in the value of V or P. It should be 

noted that our implementation of DeriveRS was based on SMO and hence S{Aq^) = Oi^Ai^). 

In some cases the computation time decreased when P or V increased. This is caused by a 

Q R 
decrease in the value of 0(^ S^L); which is inferred from the observed decrease of the 

g=lr=l 
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Q R 
size of the representative set M (M w Yl Y2^qr)- -^ sharp decrease in M was observed 

q=lr=l 

when V was increased. The impact of increasing P on the size of the representative set was 
found to be less drastic. This observation indicates that DeriveAE selects fewer approximate 
extreme points when V is larger. 



-^xlOO% (Computation time in seconds) 


P 


y = 10^ 


V = 5x10^ 


V = W^ 


V = 2x10^ 


V = 3xl03 


10^ 


10.7(27) 


6.1(67) 


5.1(131) 


4.5(258) 


4.3(338) 


5x10^ 


9.9(78) 


5.3(72) 


4.4(130) 


3.9(249) 


3.7(351) 


10^ 


9.8(142) 


5.2(83) 


4.3(134) 


3.7(242) 


3.5(352) 


2x10^ 


9.8(254) 


5.1(104) 


4.2(144) 


3.7(240) 


3.4(355) 



Table 1: The impact of varying P and V on the result of DeriveRS 



As described in Section 5.3, we compared several SVM training algorithms with our 
implementation of AESVM. We performed a grid search with all combinations of the SVM 
hyper-parameters C" = {2"^, 2-^ ..., 2^, 2^} and g = {2-"^, 2"^ 2"^, ..., 2\ 2^}. The hyper- 
parameter C" is related to the hyper-parameter C as C" = ■^. We represent the grid in 
terms of C" as it is used in several SVM solvers such as LIBSVM, LASVM, CVM and 
BVM. Furthermore, the use of C" enables the application of the same hyper-parameter grid 
to all datasets. To train AESVM with all the hyper-parameter combinations in the grid, 
the representative set has to be computed using DeriveRS for all values of kernel hyper- 
parameter g in the grid. This is because the kernel space varies when the value of g is 
varied. For all the computations, the input parameters were set as P = 10^ and V = 10^. 
The first level segregation in DeriveRS was performed using FLS2. Three values of the 
tolerance parameter e were investigated, e = 10~^, 10~^ or 10~^. 

The results of the computation for datasets Dl - D5, are shown in the Table [2j The 
percentage of data vectors in the representative set was found to increase with increasing 
values of g. This is intuitive, as when g increases the distance between the data vectors in 
kernel space increases. With increased distances, more data vectors Xj become approximate 
extreme points. The increase in the number of approximate extreme points with g causes 
the rising trend of computation time shown in Table [2| For a decrease in the value of e, 
M increases. This is because, for smaller e fewer Xj would satisfy the condition: optimized 
p(xj,^) < e in Checkpoint (xj, ^). This results in the selection of a larger number of 
approximate extreme points in DeriveAE. 

The results of applying DeriveRS to the high-dimensional datasets D6-D9 are shown in 
Table 3 It was observed that -^ was much larger for D6-D9 than for the other datasets. 
We computed the representative set with e = 10~^ only, as for smaller values of e we expect 
the representative set to be close to 100% of the training set. The increasing trend of the 
size of the representative set with increasing g values can be observed in Table |3] also. 
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^xlOO% (Computation time in seconds) 


e 


Dataset 


g=2^ 


g=^ 


g=^ 


s = ^ 


g = l 


g = 2^ 


g = 2^ 


w-^ 


Dl 


1.5(98) 


1.9(104) 


2.4(110) 


3.2(119) 


4.3(132) 


5.9(148) 


8.1(168) 


D2 


1.2(7) 


1.5(8) 


2(9) 


2.8(11) 


4.1(15) 


6(18) 


9.2(23) 


D3 


0.6(37) 


0.6(37) 


0.6(36) 


0.6(36) 


0.5(37) 


0.6(37) 


0.6(39) 


D4 


4.3(45) 


6.4(57) 


9.4(74) 


13.9(103) 


20.7(139) 


30.7(178) 


44.8(216) 


D5 


4.5(7) 


8.3(9) 


14(11) 


21.8(14) 


31.8(18) 


43.7(21) 


54.9(22) 


10-4 


Dl 


3(136) 


4(159) 


5.3(191) 


7.2(240) 


9.9(297) 


13.3(362) 


17.4(435) 


D2 


2.8(12) 


3.8(18) 


5(27) 


6.8(37) 


9.3(44) 


13.5(44) 


19.9(82) 


D3 


0.5(36) 


0.6(37) 


0.6(38) 


0.7(39) 


0.8(41) 


0.9(43) 


1.1(47) 


D4 


13.5(135) 


18.3(211) 


24.9(300) 


34.2(400) 


47.7(493) 


63.5(513) 


74.4(445) 


D5 


20.1(16) 


27.9(22) 


37.4(27) 


47.6(31) 


57.3(34) 


66(34) 


74(34) 


10-5 


Dl 


7(316) 


9.3(425) 


12.2(552) 


15.7(726) 


19.6(926) 


24.2(1112) 


28.9(1235) 


D2 


6.2(59) 


7.8(87) 


9.8(98) 


13(109) 


18.3(138) 


25.6(187) 


34.3(235) 


D3 


0.7(39) 


0.8(42) 


0.9(45) 


1.1(50) 


1.4(59) 


1.7(73) 


2.2(100) 


D4 


30.7(607) 


39.5(814) 


51.9(1051) 


66(1171) 


75.1(1044) 


77.8(839) 


78.4(649) 


D5 


43.3(50) 


51.8(58) 


60.3(62) 


67.7(63) 


73.8(59) 


78.7(52) 


81.8(44) 



Table 2: The percentage of the data vectors in X* (given by Ty xlOO) and its computation 



time for datasets D1-D5 



^xlOO% (Computation time in seconds) 


Dataset 


g= 2^ 


g=^ 


g = F 


g=l 


g=l 


g = 2i 


g = 22 


D6 


69.3(19) 


70.4(19) 


73.4(19) 


80.3(14) 


83.9(9) 


84(8) 


87.9(8) 


D7 


84.4(1077) 


84.6(1089) 


84.9(1069) 


85.6(1085) 


86.9(1079) 


89.9(1032) 


94.7(818) 


D8 


90(131) 


96.6(94) 


98.8(78) 


99.5(72) 


100(70) 


100(71) 


100(63) 


D9 


60.8(34) 


62.9(36) 


67(30) 


70.8(21) 


72.7(16) 


75.2(14) 


76.7(15) 



Table 3: The percentage of data vectors in X* and its computation time for datasets D6-D9 
with € = 10"^ 



5.3 Comparison of AESVM to SVM solvers 

To judge the accuracy and efficiency of AESVM, its classification performance was compared 
with the SMO implementation in LIBSVM, ver. 3.1. We chose LIBSVM because it is a state- 
of-the-art SMO implementation that is routinely used in similar comparison studies. To 
compare the efficiency of AESVM to other popular approximate SVM solvers we chose CVM, 
BVM, LASVM, SVMP^^^ and RfeatSVM. A description of these methods is given in Section 
[2] We chose these methods because they are widely cited, their software implementations 



are freely available and other studies ( Shalev-Shwartz et al. , 2011) have reported fast SVM 
training using some of these methods. LASVM is also an efficient method for online SVM 
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training. However, since we do not investigate online SVM learning in this paper, we did not 
test the online SVM training performance of LASVM. We compared AESVM with CVM 
and BVM even though they are L2-SVM solvers, as they has been reported to be faster 
alternatives to SVM implementations such as LIBSVM. 

The implementation of AESVM and DeriveRS were built upon the LIBSVM implemen- 
tation. All methods except SVM'''''^ were allocated a cache of size 600 MB. The parameters 
for DeriveRS were P = 10^ and V = 10^, and the first level segregation was performed 
using FLS2. To reflect a typical SVM training scenario, we performed a grid search with 
all eighty four combinations of the SVM hyper-parameters C = {2~^, 2~^, ..., 2^, 2'^} and 
g = {2-^,2-^2-^...,2\22}. As mentioned earlier, for datasets D2, D3 and D4, five fold 
cross-validation was performed. The results of the comparison have been split into sub- 
sections given below, due to the large number of SVM solvers and datasets used. 

5.3.1 Comparison to CVM, BVM, LASVM and LIBSVM 

First we present the results of the performance comparison for D2 in Figures [2] and |3J 
For ease of representation, only the results of grid points corresponding to combinations of 
C = {2-^2-2, 1,22, 2"^, 2*^} and 5 = {2-^, 2-2,1,22} are shown in Figures § and § Figure 
[2] shows the graph between training time and classification accuracy for the five algorithms. 
Figure |3] shows the graph between the number of support vectors and classification accuracy. 
We present classification accuracy as the ratio of the number of correct classifications to the 
total number of classifications performed. Since the classification time of an SVM algorithm 
is directly proportional to the number of support vectors, we represent it in terms of the 
number of support vectors. It can be seen that, AESVM generally gave more accurate 
results for a fraction of the training time of the other algorithms, and also resulted in less 
classification time. The training time and classification times of AESVM increased when e 
was reduced. This is expected given the inverse relation of M to e shown in Tables [2] and 
[3j The variation in accuracy with e is not very noticeable. 

Figures [2] and [3] indicate that AESVM gave better results than the other algorithms 
for SVM training and classification on D2, in terms of standard metrics. To present a 
more quantitative and easily interpretable comparison of the algorithms, we define the five 
performance metrics given below. These metrics combine the results of all runs of each 
algorithm into a single value, for each dataset. For these metrics we take LIBSVM as a 
baseline of comparison, as it gives the most accurate solution among the tested methods. 
Furthermore, an important objective of these experiments is to show the similarity of the 
results of AESVM and LIBSVM. In the description given below, F can refer to any or any 
approximate SVM algorithm such as AESVM, CVM, LASVM etc. 

1. Root mean squared error of classification accuracy, RMSE: The similarity of the 
solution of F to LIBSVM, in terms of its classification accuracy, is indicated by: 

R S ^ 0-5 



k^T.i^'^- 



RMSE= I — -> > '{CLl-C¥l? 



RS' ^ ^ 



Here CUg and CF^ are the classification accuracy of LIBSVM and F respectively, in 
the s^^ cross-validation fold with the r*^ set of hyper-parameters of grid search. 
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Figure 2: Plot of training time against classification accuracy of the SVM algorithms on D2 



2. Expected training time speedup, ETS: The expected speedup in training time is indi- 
cated by: 

R S 
ETS 



r=ls=l * 



Here TL^ and TF^ are the training times of LIBSVM and F respectively. 

3. Overall training time speedup, OTS: It indicates overall training time speedup for 
the entire grid search with cross-validation, including the time taken to compute the 
representative set. The total time taken by DeriveRS to compute the representative 
set for all values of g is represented as TX* . For methods other than AESVM, TX* = 0. 



OTS 



R S 




ZZTLl 


r=ls=l 




R S 




ZZT¥: 


+ TX* 


r=ls=l 
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Figure 3: Plot of classification time, represented by the number of support vectors, against 
classification accuracy of the SVM algorithms on D2 



4. Expected classification time speedup, ECS: The expected speedup in classification 
time is indicated by: 



R s 



ECS 



nl: 



rs^^^^ny: 

r=ls=l 



VI Q 1^1^ 



Here NU^ and A'^F^ are the number of support vectors in the solution of LIBSVM and 
F respectively. 

5. Overall classification time speedup, DCS: The overall speedup in classification time 

is indicated by: 

R s 

DCS 



r=ls=l 
r=ls=l 



The results of the classification performance comparison on datasets D1-D5, are shown in 
Table [4} It was observed that for all tested values of e, AESVM resulted in large reductions 
in training and classification times when compared to LIBSVM for a very small difference 
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Metric 


Dataset 


AESVM 
e = 10^3 


AESVM 
e = 10-4 


AESVM 
e = 10-5 


CVM 


BVM 


LASVM 


RMSE 
(xl02) 


Dl 


0.28 


0.16 


0.21 


0.44 


0.6 


0.12 


D2 


2.56 


1.81 


1.19 


26.59 


24.06 


2.18 


D3 


0.16 


0.10 


0.05 


0.33 


0.39 


55.2 


D4 


1.08 


0.82 


0.74 


9.4 


9.44 


- 


D5 


0.99 


0.39 


0.23 


0.74 


0.84 


0.13 


ETS 


Dl 


451.5 


145 


41.7 


8.9 


28.6 


0.8 


D2 


1614.7 


289.6 


62.8 


0.7 


0.8 


0.2 


D3 


28012.3 


14799.3 


7573.8 


60.4 


76.8 


0.9 


D4 


103.1 


13.8 


3.4 


8 


6.6 


- 


D5 


40.2 


5 


2 


0.3 


0.5 


0.6 


OTS 


Dl 


92.1 


34.2 


9.5 


6.2 


21.6 


0.8 


D2 


148.6 


45.5 


14.3 


0.5 


0.5 


0.1 


D3 


968.5 


800.6 


514.4 


23.9 


22.8 


0.5 


D4 


11.9 


4.1 


2.2 


6.2 


4.4 


- 


D5 


5.2 


2.5 


1.5 


0.2 


0.3 


0.5 


ECS 


Dl 


4.8 


3.6 


2.8 


1.2 


2 


1.1 


D2 


35.9 


15.5 


7.9 


4.7 


5 


1 


D3 


48.7 


25.8 


13.4 


0.4 


0.6 


0.6 


D4 


8.4 


3.3 


1.8 


12.4 


12.1 


- 


D5 


4.3 


1.9 


1.4 


0.8 


1 


1 


ocs 


Dl 


3.8 


3.1 


2.5 


1.1 


1.9 


1 


D2 


23.4 


10.9 


6.1 


4.5 


4.4 


1 


D3 


32.2 


16.1 


9 


0.3 


0.5 


0.2 


D4 


5.4 


2.7 


1.7 


12 


10.7 


- 


D5 


2.8 


1.8 


1.4 


0.8 


1 


1 



Table 4: Performance comparison of AESVM (with e 
LASVM and LIBSVM on datasets D1-D5 



10-3,10-4,10" 



CVM, BVM, 



in classification accuracy. Most notably, for D3 the expected and overall training time 
speedups were of the order of 10^ and 10^ respectively, which is outstanding. Comparing 
the results of AESVM for different e values, we see that RMSE generally improves by 
decreasing when e decreases, while the metrics improve by increasing when e increases. The 
increase in ETS and OTS is of a larger order than the increase in RMSE when e increases. 

Comparing AESVM to CVM, BVM and LASVM, we see that AESVM in general gave 
the least values of RMSE and the largest values of ETS, OTS, ECS and OCS. In a 
few cases LASVM gave low RMSE values. However, in all our experiments LASVM took 
longer to train than the other algorithms including LIBSVM. We could not complete the 
evaluation of LASVM for D4 due to its large training time, which was more than 40 hours 
for some hyper-parameter combinations. It was also found that LASVM sometimes resulted 
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in a larger classification time than the other algorithms including LIBSVM. CVM and BVM 
generally gave high vales of RMSE. 

Table [4] compares the classification accuracy of CVM, BVM, LASVM and AESVM to 
the exact SVM solution given by LIBSVM. Another method to compare the algorithms is in 
terms of the maximum classification accuracy, and the mean and standard deviation of the 
classification accuracies, without using LIBSVM as a reference point. Such a comparison 
for datasets D1-D5, is given in Table [5] The five algorithms under comparison were found 
to give similar maximum classification accuracies except for D2 and D4, where CVM and 
BVM gave significantly smaller values. Another interesting result is that for D3, the mean 
and standard deviation of accuracy of LASVM was found to be widely different from the 
other algorithms. For all the tested values of e the maximum, mean and standard deviation 
of the classification accuracies of AESVM were found to be similar. 



Accuracy 


Dataset 


AESVM 
e = 10-3 


AESVM 
e = 10-4 


AESVM 
e = 10-5 


CVM 


BVM 


LASVM 


LIBSVM 


Maximum 

(Xl02) 


Dl 


93.4 


93.8 


93.6 


94.1 


94.4 


94.3 


93.9 


D2 


77.1 


77.2 


77.8 


70.3 


67.1 


78.1 


78.2 


D3 


99.9 


99.9 


99.9 


99.9 


99.9 


99.9 


99.9 


D4 


68.3 


68.3 


68.3 


63.7 


62.3 


- 


68.2 


D5 


98.7 


98.8 


98.9 


99 


99.1 


99.2 


99 


Mean, 

standard 

deviation 

(xlO^) 


Dl 


92.2, 0.7 


92.3, 0.8 


92.3, 0.8 


92.7, 0.8 


92.6, 0.9 


92.5, 0.8 


92.4, 0.8 


D2 


72.3, 3.6 


73.2, 3.7 


73.6, 3.7 


52.2, 0.8 


54.6, 0.7 


73.5, 0.5 


74.1, 3.5 


D3 


99.8, 


99.8, 0.1 


99.8, 0.1 


99.8, 0.2 


99.8, 0.2 


69.3,29.9 


99.8, 0.1 


D4 


61.3, 3.1 


61, 3.1 


61, 3.1 


55.5, 3.1 


54.9, 3.4 


- 


60.6, 3.2 


D5 


96, 2.5 


96.3, 2.6 


96.5, 2.6 


96.6, 2.5 


97, 2 


97, 2 


96.6, 2.4 



Table 5: Comparison of classification accuracies of AESVM (with e 
CVM, BVM, LASVM and LIBSVM on datasets D1-D5 



10-3,10-4,10- 



Next we present the results of performance comparison of CVM, BVM, LASVM, AESVM, 
and LIBSVM on the high-dimensional datasets D6-D9. As described in Section 5.2, De- 
riveRS was run with only e = 10-3 for these datasets. The results of the performance 
comparison are shown in Tables [6] and [7j CVM was found to take longer than 40 hours to 
train on D6, Dl and D8 with some hyper-parameter values and hence we could not complete 
its evaluation for those datasets. BVM also took longer than 40 hours to train on D7 and it 
was also not evaluated for D7. AESVM consistently reported ETS, OTS, ECS and DCS 
values that are larger than 1 unlike the other algorithms. Similar to the results in Table 
|4j LASVM and BVM resulted in very large RMSE values for some datasets. The results 
in Table [7] are similar to Table [5j with similar maximum accuracies for all algorithms and 
significantly lower mean and higher standard deviation of accuracy for BVM and LASVM 
on some datasets. 
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Metric 


Dataset 


AESVM 
e = 10-3 


CVM 


BVM 


LASVM 


RMSE 

(Xl02) 


D6 


0.21 


- 


7.8 


0.85 


D7 


1.37 


- 


- 


2.37 


D8 


0.02 


- 


17.55 





D9 


0.15 


1 


0.89 


27.5 


ETS 


D6 


1.8 


- 


0.6 


0.8 


D7 


1.4 


- 


- 


0.9 


D8 


1.1 


- 


4.7 


1 


D9 


1.6 


1.4 


17.5 


0.6 


OTS 


D6 


1.5 


- 


0.6 


0.5 


D7 


1.2 


- 


- 


0.7 


D8 


1.1 


- 


2.6 


0.9 


D9 


1.3 


1.2 


16.9 


0.5 


ECS 


D6 


1.2 


- 


1.5 


1 


D7 


1.16 


- 


- 


1 


D8 


1 


- 


3.2 


1 


D9 


1.2 


1.8 


4.9 


2.3 


OCS 


D6 


1.1 


- 


1.5 


1 


D7 


1.1 


- 


- 


1 


D8 


1 


- 


2.6 


1 


D9 


1.1 


1.9 


5.2 


1.1 



Table 6: Performance comparison of AESVM (with e 
LIBSVM on datasets D6-D9 



10 



-3^ 



CVM, BVM, LASVM and 



Accuracy 


Dataset 


AESVM 
e = 10-3 


CVM 


BVM 


LASVM 


LIBSVM 


Maximum 
(xl02) 


D6 


85.2 


- 


85.2 


85 


85.1 


D7 


88.3 


- 


- 


88.4 


88.6 


D8 


99.7 


- 


99.7 


99.7 


99.7 


D9 


99.3 


99.5 


99.5 


99.5 


99.5 


Mean, 

standard 

deviation 

(Xl02) 


D6 


81.3, 2.8 


- 


80.2, 8.9 


81.1, 2.9 


81.4, 2.8 


D7 


85.3, 5.7 


- 


- 


85.2, 6.2 


85.7, 4.8 


D8 


92.3, 3.6 


- 


88.5, 18.1 


92.3, 3.6 


92.3, 3.6 


D9 


98.7, 0.8 


98.9, 0.8 


98.9, 0.8 


85.5,23.9 


98.8, 0.8 



Table 7: Comparison of classification accuracies of AESVM (with e 
LASVM and LIBSVM on datasets D6-D9 



10-3), CVM, BVM, 
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5.3.2 Comparison to SVMp<="^ 

SYMP*^"^ differs from tlie other SVM solvers in its ability to compute a solution close to 
the SVM solution for a given number of support vectors (k). The algorithm complexity 
is directly proportional to the parameter k, but with a decrease in k the approximation 
becomes worse and the difference between the solutions of SVMP'''^ and SVM increases. 
We used a value of k = 1000 for our experiments, as it has been reported to give good 



performance ( [Joachims and Yu|[2009| . SYMP*"'^ was tested on datasets Dl, D4, D5, D6, D8 
and D9, with the Gaussian kernel and the same hyper-parameter grid as described earlier. 
The results of the grid search are presented in Table [8] The results of our experiments on 
AESVM (with e = 10"^) and LIBSVM are repeated in Table [s] for ease of reference. The 
maximum, mean and standard deviation of classification accuracies are represented as max. 
Ace, mean Ace, and std. Ace. respectively. 



Dataset 


Solver 


RMSE 

(Xl02) 


ETS 


OTS 


ECS 


OCS 


max. Ace. 
(xl02) 


mean Ace. 
(xl02) 


std. Ace. 

(Xl02) 


Dl 


AESVM 


0.28 


451.5 


92.1 


4.8 


3.8 


93.4 


92.2 


0.7 


SVMP^--' 


0.74 


3.7 


0.9 


6.8 


6.8 


94 


92.7 


0.5 


LIBSVM 












93.9 


92.4 


0.8 


D4 


AESVM 


1.08 


103.1 


11.9 


8.4 


5.4 


68.3 


61.3 


3.1 


SVMP'^'-' 


2.14 


3.1 


1.2 


186.8 


186.8 


68.1 


61.8 


2.7 


LIBSVM 












68.2 


60.6 


3.2 


D5 


AESVM 


0.99 


40.2 


5.2 


4.3 


2.8 


98.7 


96 


2.5 


SVMP^--' 


0.26 


0.2 


0.1 


5.8 


5.8 


99 


96.7 


2.4 


LIBSVM 












99 


96.6 


2.4 


D6 


AESVM 


0.21 


1.8 


1.5 


1.2 


1.1 


85.2 


81.3 


2.8 


SVMP'^'-' 


9.39 


1.1 


0.9 


20 


20 


85.2 


79.6 


10.7 


LIBSVM 












85.1 


81.4 


2.8 


D8 


AESVM 


0.02 


1.1 


1.1 


1 


1 


99.7 


92.3 


3.6 


SVMP^--' 


54.2 


37.6 


23.8 


49 


49 


99.9 


55.7 


42.3 


LIBSVM 












99.7 


92.3 


3.6 


D9 


AESVM 


0.15 


1.6 


1.3 


1.2 


1.1 


99.3 


98.7 


0.8 


SVMP'^'-' 


22.6 


1.2 


0.9 


21.3 


21.3 


99.2 


86.1 


18.8 


LIBSVM 












99.5 


98.8 


0.8 



Table 8: Performance comparison of SVMP^'^f, AESVM (with e = IQ-^), and LIBSVM 

SVMP was found to generally give higher RMSE values than AESVM. In particular, 
for the high dimensional datasets (D6, D8 and D9), the RMSE values were significantly 
higher. The training speedup values of SVMP'^'' are much lower than AESVM except for 
D8. As expected, the classification time speedups of SVMP®"^ are significantly higher than 
AESVM. The maximum accuracies of all the algorithms were similar. However, the mean 



12. We used the software parameters '-t 2 -w 9 -i 2 -b -k 1000' as suggested in the author's website 
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and standard deviation of accuracies of SVMP^'^ were very different from AESVM and 
LIBSVM for the high dimensional datasets D6, D8 and D9. 

5.3.3 Comparison to RfeatSVM 



Rahimi and Recht (2007) proposed a promising method to approximate non-linear kernel 
SVM solutions using simpler linear kernel SVMs. This is accomplished by first projecting 
the training dataset into a randomized feature space and then using any SVM solver with the 
linear kernel on the projected dataset. We concentrated our experiments on investigating 
the accuracy of the solution of RfeatSVM and its similarity to the SVM solution. LIBSVM 
with the linear kernel was used to compute the RfeatSVM solution on the projected datasets. 
We used LIBSVM, in spite of the availability of faster linear SVM implementations, as it 
is an exact SVM solver. Hence only the performance metrics related to accuracy were used 
to compare the performance of AESVM, LIBSVM and RfeatSVM. The random Fourier 



features method, described in Algorithm 1 of Rahimi and Recht (2007), was used to project 



the datasets Dl, D5, D6 and D9 into a randomized feature space of dimension E. The results 
of the accuracy comparison are given in Table |9j We used a smaller hyper-parameter grid 
of all twenty four combinations of C = {2-^, 2'^, 1, 22,2"^, 2^} and g = {2'^, 2-^, 1, 2^} for 
our experiments. The results reported in Table [9] for AESVM and LIBSVM were computed 
for this smaller grid. 

We used the same number of dimensions (E) of the randomized feature space for Dl and 



D6 as in [Rahimi and Recht| ( |2007[ ). The RMSE values for RfeatSVM were significantly 
higher than AESVM for most datasets, especially for Dl and D6. The maximum accuracy 
for RfeatSVM was found to be much less than AESVM and LIBSVM for all datasets. The 
time taken to compute the randomized feature space is not reported because it was found 
to be negligibly small. Another important observation was that the projected datasets 
were found to be almost 100% dense. The training time of SVM solvers are typically 
linearly proportional to the density of the dataset and hence a highly dense dataset can 
take a significant training time even with fast linear SVMs. Dense datasets also have large 
memory requirements. 



5.4 Performance with the polynomial kernel 

To validate our proposal of AESVM as a fast alternative to SVM for all non-linear kernels, 
we performed a few experiments with the polynomial kernel, /i;(xi,X2) = (1 + x^X2)'^. The 
hyper-parameter grid composed of all twelve combinations of C" = {2~^, 2~'^, 1, 2^} and 
d = {2, 3, 4} was used to compute the solutions of AESVM and LIBSVM on the datasets 
Dl, D4 and D6. The results of the computation of the representative set using DeriveRS 
are shown in Table 10 The parameters for DeriveRS were P = 10^, V = 10^ and e = 10"'^, 



and the first level segregation was performed using FLS2. The performance comparison of 



AESVM and LIBSVM with the polynomial kernel is shown in Table 11 Like in the case 
of the Gaussian kernel, we found that AESVM gave results similar to LIBSVM with the 
polynomial kernel, while taking shorter training and classification times. 
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Dataset 


Solver 


RMSE 
(xl02) 


max. Ace. 

(Xl02) 


mean Ace. 

(Xl02) 


std. Ace. 

(Xl02) 


Original 
density 

% 


Density af- 
ter projec- 
tion % 


Dl 


AESVM 


0.24 


93.6 


92.2 


0.9 






RfeatSVM 
(E = 100) 


56.18 


37.8 


36.1 


1.3 


33 


100 


LIBSVM 




93.6 


92.3 


0.9 






D5 


AESVM 


0.9 


98.6 


95.7 


2.8 






RfeatSVM 
(E = 100) 


5.3 


94.7 


91.6 


1.4 


59 


100 


LIBSVM 




98.9 


96.2 


2.7 






D6 


AESVM 


0.16 


85.1 


81.2 


2.9 






RfeatSVM 
(E = 1000) 


4 


81.6 


78 


2.2 


11 


100 


LIBSVM 




85 


81.3 


3 






D9 


AESVM 


0.15 


99.3 


98.6 


0.8 






RfeatSVM 
(E = 1000) 


0.6 


98.7 


97.4 


0.6 


4 


95.8 


LIBSVM 




99.5 


98.8 


0.9 







Table 9: Performance comparison of RfeatSVM, AESVM (with e = 10"^), and LIBSVM. 
The density of the datasets before and after projecting into randomized feature 
spaces are also shown 



^xl00% (Computation time in seconds) 


Dataset 


d = 2 


d = 3 


d = 4 


Dl 


6.6(410) 


14.2(1329) 


22.5(3696) 


D4 


30.3(752) 


57.7(1839) 


76.5(2246) 


D6 


69(20) 


69.7(21) 


70.4(22) 



Table 10: Results of DeriveRS for the polynomial kernel 



6. Discussion 

AESVM is a new problem formulation that is almost identical to, but less complex than, the 
SVM primal problem. AESVM optimizes over only a subset of the training dataset called 
the representative set, and consequently, is expected to give fast convergence with most 
SVM solvers. In contrast, the other studies mentioned in Section [2] are mostly algorithms 
that solve the SVM primal or related problems. Methods such as RSVM also use different 
problem formulations. However, they require special algorithms to solve, unlike AESVM. 
In fact, AESVM can be solved using many of the methods in Section [2] As described in 
Corollary 5, there are some similarities between AESVM and the Gram matrix approxi- 
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Dataset 


Solver 


RMSE 

(xlO^) 


ETS 


OTS 


ECS 


ocs 


max. Ace. 

(Xl02) 


mean Ace. 

(Xl02) 


std. Ace. 

(Xl02) 


Dl 


AESVM 


0.15 


31.2 


2 


3.1 


3.1 


94 


93.5 


0.4 


LIBSVM 












94.1 


93.5 


0.4 


D4 


AESVM 


2.04 


3.3 


1.5 


2 


1.9 


64.3 


60.8 


2.5 


LIBSVM 












64.5 


60.7 


2.5 


D6 


AESVM 


0.6 


2.7 


1.9 


1.5 


1.5 


84.5 


80.5 


2.5 


LIBSVM 












84.6 


81 


2.3 



Table 11: Performance comparison of AESVM (with e = 10 ^), and LIBSVM with the 
polynomial kernel 
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Figure 4: Plot of RMSE values for all SVM solvers 



mation methods discussed earlier. It would be interesting to see a comparison of AESVM, 



with the core set based method proposed by Gartner and Jaggi (2009). However, due to the 



lack of availability of a software implementation and of published results on Ll-SVM with 
non-linear kernels using their approach, the authors find such a comparison study beyond 
the scope of this paper. 

The theoretical and experimental results presented in this paper demonstrate that the so- 
lutions of AESVM and SVM are similar in terms of the resulting classification accuracy. A 
summary of the experiments in Section [5| that compared an SMO based AESVM implemen- 
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Figure 5: Plot of niaxinium classification accuracy for all SVM solvers 



tation, CVM, BVM, LASVM, LIBSVM, SVMP'=''^ and RfeatSVM, is presented in Figures 3 
tolTl It can be seen that AESVM typically gave the lowest approximation error (RMSE), 
while giving highest overall training time speedup (OTS). AESVM also gave competitively 
high overall classification time speedup (OCS) in comparison with the other algorithms ex- 
cept SVJVP^'^K It was found that the maximum classification accuracies of all the algorithms 
except RfeatSVM were similar. RfeatSVM, and in some cases CVM and BVM, gave lower 
maximum classification accuracies. Though the results of RfeatSVM illustrated in Figures 
|4]and[5| were computed for a smaller hyper-parameter grid (refer Section 5.3.3), we believe 
it indicates the overall performance of the method. Apart from the excellent experimen- 
tal results for AESVM with the Gaussian kernel, AESVM also gave good results with the 



polynomial kernel as described in Section 5.4 



The algorithm DeriveRS was generally found to be efficient, especially for the lower 
dimensional datasets D1-D5. For the high dimensional datasets D6-D9, the representative 
set was almost the same size as the training dataset, resulting in small gains in training 
and classification time speedups for AESVM. In particular, for D8 (MNIST dataset) the 
representative set computed by DeriveRS was almost 100% of the training set. A similar 



result was reported for this dataset in Beygelzimer et al. (2006), where a divide and conquer 



method was used to speed up nearest neighbor search. Dataset D8 is reported to have 
resulted in nearly no speedup, compared to a speedup of almost one thousand for other 
datasets when their method was used. Their analysis found that the data vectors in D8 were 
very distant from each other in comparison with the other datasets ^^ This observation can 



13. This is indicated by the large expansion constant for D8 illustrated in Beygelzimer et al. (20061 
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Figure 6: Plot of overall training time speedup (compared to LIBSVM) for all SVM solvers 



explain the performance of DeriveRS on D8, as data vectors that are very distant from each 
other are expected to have large representative sets. It should be noted that irrespective 
of the dimensionality of the datasets, AESVM always resulted in excellent performance in 
terms of classification accuracy. There seems to be no relation between dataset density and 
the performance of DeriveRS and AESVM. 

The authors will provide the software implementation of AESVM and DeriveRS upon 
request. Based on the presented results, we suggest the parameters e = 10~^, P = 10^ 
and V = 10'^ for DeriveRS. A possible extension of this paper is to apply the idea of the 
representative set to other SVM variants and to support vector regression (SVR). It is 
straightforward to see that the theorems in Section [3. 2| can be extended to SVR. It would 
be interesting to investigate AESVM solvers implemented using methods other than SMO. 
Modifications to DeriveRS using the methods in Section [2] might improve its performance 
on high dimensional datasets. The authors will investigate improvements to DeriveRS and 
the application of AESVM to the linear kernel in their future work. 
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